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Abstract 

We develop a simulation-based method for the online updating of Gaussian process 
regression and classification models. Our method exploits sequential Monte Carlo to 
produce a fast sequential design algorithm for these models relative to the established 
MCMC alternative. The latter is less ideal for sequential design since it must be 
restarted and iterated to convergence with the inclusion of each new design point. We 
illustrate some attractive ensemble aspects of our SMC approach, and show how active 
learning heuristics may be implemented via particles to optimize a noisy function or 
to explore classification boundaries online. 

Key words: Sequential Monte Carlo, Gaussian process, nonparametric regression and 
classification, optimization, expected improvement, sequential design, entropy 

1 Introduction 

The Gaussian process (GP) is by now well established as the backbone of many highly flexible 
and effective nonlinear regression and classification models (e.g., Neal, 1998 Rasmussen and 



Williams 2006). One important application for GPs is in the sequential design of computer 



experiments (Santner et al. , 2003) where designs are built up iteratively: choose a new design 



point X according to some criterion derived from a GP surrogate model fit; update the fit 
conditional on the new pair (x, y{x))] and repeat. The goal is to keep designs small in order 
to save on expensive simulations of y{x). By "fit" we colloquially mean: samples obtained 
from the GP posterior via MCMC. While it is possible to choose each new design point 
via full utility-based design criterion (e.g., Miiller et al. , 2004), this can be computationally 



daunting even for modestly sized designs. More thrifty active learning (AL) criterion such 



as ALM (MacKay, 1992) and ALC (Cohn, 1996) can be an effective alternative. These were 



first used with GPs by Seo et al. (2000), and have since been paired with a non-stationary 
GP to design a rocket booster (Gramacy and Lee, 2009). 

Similar AL criteria are available for other sequential design tasks. Optimization by 



expected improvement (EI, Jones et al. , 1998) is one example. Taddy et al. (2009) used 
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an embellished EI with a non-stationary GP model and MCMC inference to determine the 
optimal robust configuration of a circuit device. In the classification setting, characteristics 



like the predictive entropy ( Joshi et al. , 2009) can be used to explore the boundaries between 
regions of differing class label in order to maximize the information obtained from each new 
X. The thrifty nature of AL and the flexibility of the GP is a favorable marriage, indeed. 
However, a drawback of batch MCMC-based inference is that it is not tailored to the online 
nature of sequential design. Except to guide the initialization of a new Markov chain, it is not 
clear how flts from earlier iterations may re-used in search of the next x. So after the design 
is augmented with {x,y{x)) the MCMC must be restarted and iterated to convergence. 

In this paper we propose to use a sequential Monte Carlo (SMC) technique called particle 
learning (PL) to exploit the analytically tractable (and Rao-Blackwellizable) GP posterior 
predictive distribution in order to obtain a quick update of the GP flt after each sequential 
design iteration. We then show how some key AL heuristics may be efficiently calculated 
from the particle approximation. Taken separately, SMC/PL, GPs, and AL, are by now well 
established techniques in their own right. Our contribution lies in illustrating how together 
they can be a potent mixture for sequential design and optimization under uncertainty. 



The remainder of the paper is outlined as follows. Section LI describes the basic elements 



of GP modeling. Section [L2 reviews SMC and PL, highlighting the strengths of PL in our 
setting. Section [2] develops a PL implementation for GP regression and classification, with 
illustrations and comparisons to MCMC. We show how fast updates of particle approxima- 
tions may be used for AL in optimization and classification in Section [3| and we conclude 
with a discussion in Section |4j Software implementing our methods, and the specific code for 



our illustrative examples, is available in the plgp package (Gramacy 2010) for R on GRAN 



1.1 Gaussian process priors for regression and classification 

A GP prior for functions F : — )■ M, where any finite collection of outputs are jointly 



Gaussian (Stein, 1999), is defined by its mean fi{x) = K{Y{x)} = f{x)~^(3 and covariance 
C{x,x') = K{[Y{x) — fi{x)][(Y {x') — fi{x')]^]}. Often the mean is linear in the inputs 
{f{x) = [l,x]) and (3 is an unknown (p + 1) x 1 parameter vector. Typically, one separates 
out the variance in C{x,x') and works with correlations K{x,x') = a^'^C{x,x') based 



on Euclidean distance and a small number of unknown parameters; see, e.g., [Abrahamsen 



(1997). In the classical inferential setting we may view the GP "prior" as a choice of model 
function, thus resembling a likelihood. 

In the regression problem, the likelihood of data Dn = {Xn, Yn), where Xn is a. N x p 
design matrix and Yn is a x 1 response vector, is multivariate normal (MVN) for Y^ 
with mean fi{Xj^) = F^P, where is a x (p + 1) matrix that contains f{xiY in its 
rows, and covariance S(Xjv) = a'^Kj^, where is the N x N covariance matrix with {ijY^ 
entry K{xi,Xj). To reduce clutter, we shall drop the A^ subscript when the context is clear. 
Conditional on K, the MLE for /3 and cr^ is available in closed form via weighted least 
squares. The profile likelihood may be used to infer the parameters to K{-, ■) numerically. 

Bayesian inference may proceed by specifying priors over /3, cr^, and the parameters to 
K{-,-). With priors /3 oc 1 and o"^ ~ IG(a/2,6/2), the marginal posterior distribution for 
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K{-^ •), integrating over /3 and cr^, is available in closed form (Gramacy, 2005, Section A. 2): 

a-\-N — p 



K\J (27r)^r[a/2] 



^ = Y^K-^Y - (3'^V^'^(3, (3 = V,3iF'^K-^Y), Vp = {F'^K-^F)-\ (2) 

It is possible to use a vague scale-invariant prior (a, 6 = 0) for o"^. In this case, the marginal 
posterior ([T]) is proper as long as > p+1. Mixing is generally good for Metropolis-Hastings 
(MH) sampling as long as K{-, ■) is parsimoniously parameterized, N is large, and there is 
a high signal-to-noise ratio between X and Y. Otherwise, the posterior can be multimodal 



(e.g., Warnes and Ripley, 1987) and hard to sample. 

Crucially for our SMC inference via PL [Section [2] , and for our AL heuristics [Section 
3], the fully marginalized predictive equations for CP regression are available in closed form. 
Specifically, the distribution of the response Y{x) conditioned on data D and covariance 
K{-, ■), i.e., p{y{x)\D, K), is Student-t with degrees of freedom v = N — p — 1, 

mean y{x\D, K) = fix^ ^ + k'^ {x)K~\Y - F^), (3) 



.2. in {b + ^)[K{x,x) - k'^{x)K-'^k{x)] 



and scale a (x\D,K) = . (4) 

where k^ [x) is the A^-vector whose i"" component is K{x,Xi). 

In the classification problem, with data D = {X,C), where C is a A^ x 1 vector of 
class labels q G {1, . . . ,M}, the CP is used M-fold as a prior over a collection of M x 
A^ latent variables 3^ = {^{m)}m=i) set for each class. For a particular class m, the 
generative model (or prior) over the latent variables is MVN with mean /i(m)(X) and variance 
S(m)(^), as in the regression setup. The class labels then determine the likelihood through 
the latent variables under an independence assumption so that p{C]\f\y) = YliLiPiy where 



Pi = PiC{xi) = Ci\yi). Neal (1998) recommends a softmax specification: 



P{c\yil:M)) = ^ T- (5) 

The M X N latents y add many degrees of freedom to the model, enormously expanding 
the parameter space. A proper prior (a, 6 > 0) for a^^-^ is required to ensure a proper 
posterior for all A^. There is little benefit to allowing a linear mean function, so it is typical 
to take /(x) = 0, and thus p = 0, eliminating from the model. Conditional on the Y(^rn), 
samples from the posterior of the parameters to the rri^^ CP may be obtained as described 
above via E q. Given parameters, several schemes may be used to sample 3^ via Eqs. 
[see Section [2.2|. The predictive distribution, required for our SMC/PL algorithm, is more 



involved [also deferred to Section 2.2 . Almost irrespective of the details of implementation, 
inference for CP classification is much harder than regression. In practice, only A^ x (M — 1) 
latents, and thus M — 1 CPs, are necessary since we may fix Y(m) = 0, say, without loss 
of generality. Although having fewer latents makes inference a little easier, it introduces an 
arbitrary asymmetry in the prior which may be undesirable. To simplify notation we shall 
use M throughout, although in our implementations we use M — 1. 
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1.2 Sequential Monte Carlo 



Sequential Monte Carlo (SMC) is an alternative to MCMC that is designed for online infer- 
ence in dynamic models. In SMC, particles {Sl.^^}^i containing the sufficient information 
about all uncertainties given data = {zi, . . . , zt) up to time t are used to approximate the 
posterior distribution: {s\^''}f^i ~ p{St\z^). In Section [2 we describe the sufficient informa- 
tion St for our CP regression and classification models. The key task in SMC inference is to 
update the particle approximation from time t to time t + 1. 

Our preferred SMC updating method is particle learning (PL, e.g. 



Carvalho et al. 2008) 



due to the convenient form of the posterior predictive distribution of GP models. The PL 
update is derived from the following decomposition. 

p{St+r\z'+^) = j j9(Si+i|5i,^i+i) dF{St\z'+^) oc j p{St+,\SuZt+Mzt+i\St) dF{Sty/) 

This suggests a two-step update of the particle approximation: 

1. resample the indices {i}^i with replacement from a multinomial distribution where 
each index has weight Wi oc p{zt+i\sl^^) = J p{zt+i\St+i)p{St+i\St) dSt+i, thus obtaining 



new indices {C(i)}^^ 



2. propagate with a draw from sji^^ ~ p{St+i\S^^^' , Zf^i) to obtain a new collection of 



particles {Sj^^},^! p{St+i\z 
The core components of PL are not new to the SMC arsenal. Early examples of related 



propagation methods include those of Kong et al. (1994), with resampling and the prop 



agation of sufficient statistics by Liu and Chen (1995, 1998), and look-ahead by Pitt and 



Shephard ( 1999 ). Like many SMC algorithms, PL is susceptible to an accumulation of Monte 



Carlo error with large data sets. However, two aspects of our setup mitigate these concerns 
to a large extent. Firstly, the over-arching goal of sequential design is to keep data sets as 
small as possible. CPs scale poorly to large data sets anyways, regardless of the method 
of inference (SMC, MCMC, etc.), so drastically different approaches are recommended for 
large-scale sequential design. Secondly, we only use vague priors for parameters which can be 
analytically integrated out in the posterior predictive — the main workhorse of PL — so that 
there is no need to sample them. In this way we extend the class of models for which SMC 
algorithms apply. However, we note that in order to use vague priors we must initialize the 
particles at some time to > 0. Further explanation and development is provided in Section 

m 



2 Particle Learning for Gaussian processes 

To implement PL for CPs we need to: identify the sufficient information Sf, initialize the 
particles; derive p{zt+i\St) for the resample step; and determine p{St+i\St, Zt+i) for the prop- 
agate step. We first develop these quantities for GP regression and then extend them to 
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classification. Although GPs are not dynamic models, we will continue to index the data 
size, which was N in Dn in Section [lT} with t in the SMC framework so that z*' = D^. We 
use N for the number of particles. As GPs are nonparametric priors, their sufficient informa- 
tion has size in fl{t), i.e., they depend upon the full z*. For example, the covariance ^{Xt) 
typically requires maintaining O(t^) quantities to store the distances between the pairs of 
rows in X^. Therefore is tacitly part of the sufficient information Sf. 



2.1 Regression 

Sufficient Information: Recall that zt = {Xt, Yt) in the regression setup. From our 
discussion in Section 1.1, the sufficient information, St, needed for GP regression comprises 
only of the parameters of K{-,-), defining via the pairs of rows in Xt. All of the other 
necessary quantities {(3t = f3{Kt), and ipt = i^{.Kt)) may be calculated directly from Kt and 
0*. However, we prefer to think of the sufficient information as St = {Kt, fit, '4't} for a clearer 
presentation and efficient implementation. 

Initialization: Particle initialization depends upon the choice of prior for a^. With a 
proper prior (a, 6 > 0) we may initialize the N particles at time to = with a sample of the 

K{-, ■) parameterization from its prior, K^^ ~ tt{K). We then calculate Pq\iPq^ from Kq'' 
following Eq. (|2), thereby obtaining Sq \ To take an improper prior (a, 6 = 0) on requires 
initializing the particles conditional upon > p + 1 data points to ensure a proper posterior. 
In other words, we must start the SMC algorithm at a time later than zero. We find that a 
MH scheme for obtaining K^^^l ~ p{K\z^°), via proposals from the prior 'k{K) and accepting 
via Eq. ([T]), works well for to small (i.e., not much larger than p+ 1) since the prior is similar 



to the posterior {p{K\z^'^)) in this case. We may then calculate Pto,'^to 



(0 jXi) 



from K^^^ and z*" 



following Eq. (2), thereby obtaining 5*^*^ Both approaches (proper or improper prior for cr^) 
require a proper prior on the parameters to K{-, ■). Sensible defaults exist for many of the 
typical choices for K{-, ■). One such choice is suggested in our illustration, to follow shortly. 

Resample: Technically, calculating the weights for the resample step requires integrating 
over p{St+i\St). But since the GP is not a dynamic model we can only talk about St+i 
conditional on Zt+i = {xt+i, Ut+i)- Therefore, p{zt+i\sf'') is just the probability of yt+i under 



the Student-t (3 



4) given S^ 



oc p{y{xt+i)\z\Kf^) = p{y{xt+i)\z\ K, 



Propagate: The propagate step updates each resampled sufficient information 5*^^^*^ to 
account for Zt+i = {xt+i,yt+i)- Since the parameters to K{-,-) are static, i.e., they do not 
change in t, they may by propagated deterministically by copying them from S'f*^*'* to Sjf^i- 
We note that, as a matter of efficient bookkeeping, it is the correlation matrix Kt+i and its 
inverse Kf'_^^ that are required for our PL update, not the values of the parameters directly. 

xt+i, Xj), for j = 1, . . . , t + 1 as 



The new is built from fsff ^ and K'^^ 



(i) 
t+1 



k'l'^'^ (xt+i) K^'\xt+i,xt+i) 



kf\xt. 
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The partition inverse equations yield [kI^Ji] 



m 0{t^) rather than 0{t^ 



(Kit) 



+ 9l 



[xt+i)gf''^{xt+i)/fi'i\xt+i)] 



9t i^t+i) 



where g{x) 



-fi{x)K ^k{x) and = [K{x,x) 



{x)K-'^k{x)]-\ Using Eq. (g to 
calculate and V'l+il-^t+i) takes time in O(t^). It is possible to update these 



quantities in 0{max{t,p}) time (e.g., [Escobar and Moserj 1993) from their counterparts in 
Sf*^*"* and the new Zt-^-i. However, this would not improve upon the overall complexity of the 
propagate step so we prefer the simpler expressions ([2]). 

Deterministically copying K{-, ■) in the propagate step is fast, but it may lead to particle 
depletion in future resample steps. An alternative is to augment the propagate with a sample 
from the posterior distribution via MCMC to rejuvenate the particles (e.g., MacEachern 



et al. 1999 Gilks and Berzuini 2001[ ). In our regression GP context, just a single MH step 



for the parameters to K{-, ■) using Eq. Q, for each particle, suffices. The particles represent 
"chains" in equilibrium so it is sensible to tune the MH proposals for likely acceptance by 
making their variance small, initially, relative to the posterior at the starting time t = to, and 
then further decreasing it multiplicatively as t increments. Such MH rejuvenations position 
the propagate step as a local maneuver in the Monte Carlo method, whereas resampling via 
the predictive is a more global step. Together they can emulate an ensemble method. 



An illustration: In our illustrations we follow Gramacy and Lee (2008) and take K(-,-) 
to have the form K{x,x'\g) = K*{x,x') + gSx^x', where S.^. is the Kronecker delta function, 
and —1 < K*{x,x') < 1. The g term, referred to as the nugget, must be positive and 
provides a mechanism for introducing measurement error into the stochastic process. It 
causes the predictive equations to smooth rather than interpolate, encoding (Gramacy 
2005, Appendix B) the process Y{x) = ji{x) + e{x) + rj, where /x is the mean, e is the GP 
covariance structure {a'^ K* {■,■)) , and rj is the noise process (cr^g)- We take K*{-,-) to be 
an isotropic Gaussian correlation function with unknown range parameter d: K*{x,x'\d) = 
exp {\\x — x'W^ /d}. Upon scaling the inputs {X) to lie in [0,1]^ and the outputs {Y) to 
have a mean of zero and a range of one, it is easy to design priors for d and g since the 
range of plausible values is greatly restricted. We use Exp(A = 5) for both parameters 
throughout. Random walk MH proposals from a uniform "positive sliding window" centered 
around the previous setting works well for both parameters. E.g., d* ~ U nif {id /u,ud/i) for 
u> i > 0. The setting {u,i) = (4,3) is a good baseline (Gramacy, 2007). When using MH 
for rejuvenation one may increase u and i with t to narrow the locality. 
Consider the 1-d synthetic sinusoidal data first used by Higdon (2002), 



y{x) 



sm 



V5) 



/Tlx 



1 / Attx 

+ - COS 



(6) 



where x G [0,9.6], capturing two periods of low fidelity oscillation (the sine term). We 
observe the response with noise Y{x) ~ N{y{x),a = 0.1). At this noise level it is difficult 
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to distinguish the high fidehty oscillations (the cosine term) from the noise without many 



samples. We used a T = 50 Latin hypercube design (LHD, e.g., Santner et al. , 2003, Section 
5.2.2) — ^just large enough to begin to detect the high fidelity structure. 

For PL we used = 1000 particles with an improper, scale-invariant prior (a, 6 = 0) 
for cr^. The particles were initialized at time = 5 via 10,000 MH rounds, saving every 
10**^. This took about 30 seconds in our R implementation on a 3GHz Athalon workstation. 
The remaining 45 PL updates with MH rejuvenation steps {0{t^)) following deterministic 
propagates took about 5 minutes: the first few {t < 10) took seconds, whereas the last 
few (t > 45) took tens of seconds. It is this fast between-round updating that we exploit 
for sequential design in Section |3} Foregoing rejuvenation (0(t^)) drastically reduces the 
computational demands for fixed A^, but larger N is needed to get a good fit due to particle 
depletion. 
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Figure 1: Predictive surface (s) for the sinusoidal data in terms of the posterior mean (black 
solid) and central 90% credible interval(s) (red dashed). Each particle is is represented on 
the left with three lines, and the average of the particles is on the right. 

The left panel of Figure [T] shows the point-wise predictive distribution for each of the 
1,000 particles in terms of the mean(s) and central 90% credible interval(s) of the Student-t 
distributions (3-4| with parameters yl^\ af"^^ and pf^ obtained from sf\ Their average, the 
posterior mean predictive surface, is shown on the right. Observe that some particles lead 
to higher fidelity surfaces (finding the cosine) than others (only finding the sine). Figure |2] 
shows the samples of the range [d) and nugget {g) obtained from the particles. Only 200 of 
the 1,000 are shown to reduce clutter. The clustering pattern of the black diamonds indicates 
a multimodal posterior. 

For contrast we also took 10,000 MCMC samples from the full data posterior, thinning 
every 10 and saving 1,000. This took about one minute on our workstation, which is faster 
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Samples of range (d) and nugget (g) 
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Figure 2: 200 samples of the range [d) and nugget {g) parameter obtained from particles 
(black diamonds) and from MCMC (red squares). 

than the full PL run, but much slower than the individual updates t — )■ t + 1. The marginal 
chains for d and g seemed to mix well (not shown) but, as Figure [2] shows [plotting last 200 
sample pairs as red squares] , the chain nevertheless became stuck in a mode of the posterior, 
and only explored a portion of the high density region. 

For a more numerical comparison we calculated the RMSE of predictive means (obtained 
via PL and MCMC, as above) to the truth on a random LHD of size 1000. This was repeated 
100 times, each with new LHD training (size 50, as above) and test sets. The mean (sd) 
RMSE was 0.00079 (0.00069) for PL, and 0.00098 (0.00075) for MCMC. As paired data, the 
average number of times PL had a lower RMSE than MCMC was 0.64, which is statistically 
significant {p = 5.837 x 10^^) using a standard one-sided t-test. In short, this means that 
the SMC/PL method is performing at least as well as the MCMC with quicker sequential 
updates. The MCMC could be re-tuned, restarted, and/or run for longer to narrow the 
RMSE gap, but all of these would come at greater computational expense. 

2.2 Classification 

Sufficient Information: In classification we use M CP priors on M x t latent variables. 
Therefore, St comprises of {K(^rn),t, ^(m),t,i^(m),t}m=i and yK 

Initialization: Particle initialization is identical to an M-fold application of regression 
CP particle initialization. As remarked in Section [lT| we must use a proper prior (a, b > 0) 
for each o"^^)- As a consequence, we may initialize all of the particles at to = by sampling 

{-^(m) o}rn=i identically from 7r(i^). There are no latent y at time zero, so neither they nor 
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{/3(m),0; V^(m),o}m=i required. It is also possible to initialize the particles at to > 0, which 
may be desirable in some situations. In this case, a hybrid of the MH scheme for regression 
GP's, applied M-fold, and a sampling of the latent 3^*° yielding {/9(m),to! ^(m),to}m=i5 
described below for the propagate step, works well. 

Resample: It may be helpful to think of the latent as playing the role of (hidden) states 
in a dynamic model. Indeed, their treatment in the PL update is similar. However, note 
that they do not satisfy any Markov property. The predictive density p{zt+i\St), which is 
needed for the resample step, is the probabihty of the label Ct+i{xt+i) under the sufficient 
information St'. p{ct+i{xt+i)\St). This depends upon the M latents y{xt+i), which are not 
part of St- For an arbitrary x, the law of total probability gives 

p{c{x)\St)= [ p{c{x),y{x)\St)dy{x)= [ p{c{x)\y{x))p{y{x)\St) dy{x). (7) 

The second equality comes since, conditional on y{x), the label does not depend on any 
other quantity ([s]). The M GP priors are independent, so piy{x)\St) decomposes as 

M 

p{y{x)\St) = Y[p{y{m){x)\Y(^m),t,K(^rn),t), (8) 
m=l 

where each component in the product is a Student-t density ([3]-[4]). 

The M-dimensional integral in Eq. ([T]) is not analytically tractable, but it is trivial to 
approximate by Monte Carlo as follows. Simulate many independent collections of samples 
from each of the M Student-t distributions (|8|: 

Y{x)^'^^ = {y(^rn){xY}fil, where y{rn){xY ''^ P{y{m){x)\Y(^m),t, K(^rn),t), (9) 

for £ = 1, . . . , L, say — thereby collecting M x L samples. Then pass these latents through 
the likelihood ([s]) and take an average: 

1 ^ 

p{c{x)\S,)^-J2pic{x)\Y{xf^). (10) 

e=i 

With as few as L = 100 samples this approximation is quite accurate. The weights {wi}fLi, 
where Wi oc p{ct+i{xt+i)\Sl:^^) , may be used to obtain the resampled indices {C{i)}fLi. Ob- 
serve that even L = 1 is possible, since we may then view Y{x)^^^ as an auxiliary member 
of the state Sf. So any inaccuracies in the approximation simply contribute to the Monte 
Carlo error of the PL method, which may be squashed with larger N. 

Propagate: The GP classification propagate step is essentially the aggregate of M regres- 
sion propagates. But these may only commence once the new latent (s) for t + 1 are incorpo- 
rated. We may extend the hidden state analogy to sample yft+i ~ p{y{xt+i)\St^''^) via the 
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independent Student-t distributions (|8|). In expectation we have that {Pf^) ^,'^'1^) t}rn= 



{C!,'^M,t}^=il3^*+''^^^{4i:),t}^=i since y^l'l ~ p{y{x,^,)\Sr), so no update of the 



7C(i) 



rest of the sufficient information is necessary at this juncture. To complete the propagation 
we must sample the full set of latents 37*+i'f(*) conditional upon ct+i via 2*+^. Once obtained, 
these fully propagated latents 37*+i'(*) may be used to update the remaining components of 
the sufficient information: {/^fl . , p ^^U , p i^S^.t , Jf 



^M,+P^E::),+P^S),m}"=il3^*^''^^^ comprising ^« . 



Sampling the latents may proceed via ARS, following Neal (1998). However, as in the 



regression setup, we prefer a more local move in the PL propagate context to compliment the 



globally-scoped resample step. So instead we follow Broderick and Gramacy (2010) in using 



10-fold randomly blocked MH-within-Gibbs sampling. This approach exploits a factorization 
of the posterior as the product of the class likelihood ([s]) given the underlying latents and 
their GP prior ([T]): (dropping the C(0) 



Picixj)\y^\xj)) X p(F*+;(x,)|r/+i(x. 



(m) 



r),^(-,-))- 



:iii 



Here, / is an element of a 10-fold (random) partition Xio of the indices 1, . . . ,t -|- 1, where 
|/| < 10 and — / = Xio\/ is its compliment. Extending the predictive equations from Section 
2.1, the latter term in Eq. (11) is an | J|-dimensional Student-t with Oj = | — /| — p — 1, 



mean vector 



and scale matrix 



(12) 



a + ui 



using the condensed notation Yj = Y(^rn){Xi), and |/| x |/'| matrix Kjji = K(^rn){Xi,Xi'), 
etc. A thus proposed Y^^-^^Xi) may be accepted according to the likelihood ratio since the 
prior and proposal densities cancel in the MH acceptance ratio. Let y'j denote the collection 
of M X (t + 1) latents comprised of Y(^^{Xj), Y^:^^^{Xj), and y'+^{X^j). Then the MH 
acceptance probability is min{l,y4} where 



A 



picix,)m 

p{C{x,)\yj^') 



n 



pic^\yr] 



Upon acceptance we replace Y/^j^) 



we loop over m 



withF('„)(X,) 



and otherwise do nothing. In this way 
M and / G Iio to obtain a set of fully propagated latents. 



An Illustration: Consider data generated by converting a real-valued output y{x) 



Xiexp{—x1 — x\) into classification labels (Broderick and Gramacy, 2010) by taking the 



sign of the sum of the eigenvalues of the Hessian of y{x). This gives a two-class process 
where the class is determined by the direction of concavity at x. For our illustration we take 
X G [—2,2]^, and create a third class from the first class (negative sign) where Xi > 0. We 
use M — 1 = 2 GPs, and take our data set to be T = 125 input-class pairs from a maximum 
entropy design (MED, 



Santner et al. 2003 



Section 6.2.1). Our N = 1000 particles are 
initialized using 10,000 MCMC rounds at time to = 17, thinning every 10. This takes less 
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class mean 
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Figure 3: Class posterior mean [left) and entropy posterior mean [right) for the PL fit to 
the 3-class 2-d exponential data. The classes are represented by three shades of gray; and 
the heat map for the entropy is hottest (whitest) for large values. The inputs are black open 
circles, and the miss-classified predictive locations are solid red circles. 



than 2 minutes in R on our workstation. Then we proceed with 108 PL updates, which takes 
about four hours. The first few updates take less than a minute, whereas the last few take 
7-8 minutes. 

Figure [3] shows the posterior predictive surface, interpolated from 1,000 MED test lo- 
cations, in terms of the most likely label from the mean posterior predictive [left), i.e., 
axgmaxrn N~^^f^^Pm{x) where pm{x) = p{c{x) = m)^*) ^ p{C{x) = m|S'f*^), and the 
mean entropy (right) of the label distribution — X]m=iP™ (^) ^°SPm (x). The 125 training 
inputs are shown as open black circles and the 76 misclassified test locations are shown as 
solid red ones. Observe that the predictive entropy is highest where determining the class 
label is most difficult: near the boundaries. 

The differences in Monte Carlo efficiency between PL and MCMC, here, are less stark. 
There is less scope for the posterior to be multimodal due to the role of the nugget. For 
classification, the nugget parameterizes the continuum between logit (small nugget) and 
probit (large nugget) models (Neal, 1998), which is a far more subtle than interpolation 
versus smoothing as in regression. In terms of computational complexity we can offer the 
following comparison. Obtaining 10,000 MCMC samples, thinning every 10, for the full 
T = 125 input-class pairs took about 45 minutes. While this is several times faster than PL 
on aggregate, observe that a single PL update for the 126**^ input-class pair can be performed 
several times faster than running a full MCMC from scratch. 

For a further comparison of timings on a larger classification problem we duplicated the 
10-fold cross validation (CV) experiment of Broderick and Gramacy (2010) on the two-class 
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credit approval data which has p = 47 covariates for 690 (x, c) pairs. The time required for 
the final PL update (t ~ 621) with N = 1000 particles, averaged over the 10 CV folds, was 38 
minutes. The resulting predictor(s) gave exactly the same misclassification error(s) averaging 
14.6% (4% sd) on the hold out sets as a similar estimator based on MCMC However, the 
authors reported that the MCMC took about 5.5 hours on average. So even with a modestly 
large design (fs 621), the Monte Carlo error that might accumulate with the use of vague 
priors in SMC does not seem to (yet) be an issue in our PL implementation. The savings in 
time is huge due the decomposition of far fewer 621 x 621 covariance matrices in the SMC 
framework. 



3 Sequential design 

Here, we illustrate how the online nature of PL is ideally suited to sequential design by AL. 
Probably the most straightforward AL algorithms in the regression context are ALM and 
ALC [see Section [l.l| . But these are well known to approximate space filling MEDs for 
stationary CP models. So instead we consider the sequential design problem of optimizing a 
noisy black box function. In the classification context we consider the sequential exploration 
of classification boundaries. 



3.1 Optimization by expected improvement 



Jones et al. ( 1998[ ) described how to optimize a deterministic black box function using a 
"surrogate" model (i.e., a CP with = 0) via the MLE (for {(i, /3, o"^}). The essence is 
as follows. After t samples are gathered, the current minimum is /min,t = niin{?/i, . . . ,yt}. 
The improvement at x is It{x) = maxj/min,* — Yt{x),0}, a random variable whose distri- 
bution is determined via Yt{x) = Y{x)\z^, Kt, which has a Student-t distribution ([3]-[4]). 
The expected improvement (EI) is obtained by analytically integrating out Yt{x). A branch 
and bound algorithm is then used to maximize the EI to obtain the next design point 
Xt+i = argmaxE{/t(s)}. The resulting iterative procedure (choose Xt+i] obtain yt+i{xt+i)] 
refit and repeat) is called the efficient global optimization (EGO) algorithm. 

The situation is more complicated when optimizing a noisy function, or with Bayesian 
inference via Monte Carlo. A re-definition of /mm,f accounts for the noisy {g > 0) responses: 
either as the first order statistic of Y{Xt) or as the minimum of the predictive mean surface, 
mmxi/tix). Now, each sample (e.g., each particle) from the posterior emits an EI. Using 

we have 



our Student-t predictive equations (3-4| for Sf^\ letting Sli'\x) = f\ 



(following Wilhams et al. , 2000): 



E{/,(x)|5«} 




x) + 



5?{xf 



a 



X 




(13) 



The EI is then approximated as E{/i(x)} ^ ^ Yl!i=i '^{h{x)\S[^'^}, thereby taking param- 
eter uncertainty into account. But the branch and bound algorithm no longer applies. 
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A remedy, proposed to ensure convergence in the optimization, involves pairing EI with 
a deterministic numerical optimizer. Taddy et al. (2009) proposed using a GP/EI based 
approach (with MCMC) as an oracle in a pattern search optimizer called APPS. This high 
powered combination offers convergence guarantees, but unfortunately requires a highly cus- 



tomized implementation that precludes its use in our illustrations. Gramacy and Taddy 



(2009, Section 3) propose a simpler, more widely applicable, variant via the opposite embed- 
ding. There are (as yet) no convergence guarantees for this heuristic, but it has been shown 
to perform well in many examples. 

Both methods work with a fresh set of random candidate locations Xt at each time 
t, e.g., a LHD. In the oracle approach, the candidate which gives the largest EI, = 
argmax-gjs^^ E{/t(5)}, is used to augment the search pattern used by the direct optimizer 
(APPS) to find Xt+i- In the simpler heuristic approach the candidate design is augmented 
to include the minimum mean predictive location based upon the MAP parameterization at 
time t. In our SMC /PL implementation this involves first finding i* = arg maxi=i^..._Ar p{Sj.^^ \ z*) 
and then finding = argmin^.^^* (R's optim function works well for the latter search 

when initialized with argmin^g^j^^ \^)-) We may then take Xt+i = arg max-^^^^^. E{/f (x)}, 
having searched both globally via Xf, and locally via x^. 



optim 



max log El 
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Figure 4: Tracking the progress of GP/EI optimization via PL. The left plot shows Xt (points) 
and (lines); the right plot shows logE{/f(a;t+i)}. 

Figure |4] illustrates the progress of this algorithm with PL inference on the 2-d exponential 
data [Section 2.2 , observed with A^(0, a = 0.001) noise. The = 1000 particles were 
initialized at time to = 7 with a LHD. Each Xt is a fresh size 40 LHD. The left panel 
tracks = (x*j,X2t), the optimal additional candidate, as lines and the chosen xt+i = 
{xi^t+i,X2,t+i) as points, both from t = to,---,T = 50. Observe how the points initially 
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explore to find a (local) optima, and then later make excursions (unsuccessfully) in search 
of an alternative. The right panel tracks the maximum of the log El, log{E,{It{xt+i)}) , 
from t = to, . . . ,T. Observe that this is decreasing except when xt+i 7^ x^, corresponding 
to an exploration event. The magnitude and frequency of these up-spikes decrease over 
time, giving a good empirical diagnostic of convergence. At the end we obtained = 
(-0.7119,0.0070), which is very close to the true minima x* = {-^/l/2,0). The 43 PL 
updates, with searches, etc., took about eleven minutes in R on our workstation. By way 
of comparison, the equivalent MCMC-based implementation (giving nearly identical results) 
took more than 45 minutes. 



3.2 Online learning of classification boundaries 



In Section 2^ [Figure [3] we saw how the predictive entropy could be useful as an AL heuristic 
for boundary exploration. Joshi et al. (2009) observed that when M > 2, the probability of 
the irrelevant class (es) near the boundary between two classes can influence the entropy, and 
thus the sequential design based upon it, in undesirable ways. They showed that restricting 
the entropy calculation to the two highest probabilities {hest-versus- second-best [BVSB] 
entropy) is a better heuristic. 




Figure 5: Class posterior mean {left) and entropy posterior mean {right) for the PL flt to the 
3-class 2-d exponential data by AL with the BVSB entropy heuristic, for comparison with 
the static design version in Figure |3] 

Figure |5] shows the sequential design obtained via PL with = 1000 particles and the 
BVSP entropy AL heuristic using a pre-deflned set of 300 MED candidate locations. The 
design was initialized with a to = 25 sub-MED (from the 300), and AL was performed at 
each of rounds t = to, . . . ,T = 125 on the 300 — t remaining candidates. This time there 
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are 40 misclassified points, compared to the 76 obtained with a static design [Section 2.2 



the same 1,000 MED test set was used]. The running time here is comparable to the static 
implementation. MCMC gives similar results but takes 4-5 times longer. 

Working off-grid, e.g., with a fresh set of LHD candidates in each AL round, is slightly 
more challenging because the predictive entropy is very greedy. Paradoxically, the highest 
(BVSB) entropy regions tend to be near the boundaries which have been most thoroughly 
explored — straddling it with a high concentration of points — even though the entropy rapidly 
decreases nearby. One possible remedy involves smoothing the entropy by a distance-based 
kernel (e.g., K{-, ■) from the GP) over the candidate locations. Applying this heuristic leads 
to very similar results as those reported in Figure [5| and so they are not shown here. 



4 Discussion 

We have shown how GP models, for regression and for classification, may be fit via the 
sequential Monte Carlo (SMC) method of particle learning (PL). We developed the relevant 
expressions, and provided illustrations on data from both contexts. Although SMC methods 
are typically applied to time series data, we argued that they are also well suited to scenarios 
where the data arrive online even when there is no time or dynamic component in the 
model. Examples include sequential design and optimization, where a significant aspect of 
the problem is to choose the next input and subsequently update the model fit. In these 
contexts, MCMC inference has reigned supreme. But MCMC is clearly ill-suited to online 
data acquisition, as it must be restarted when the new data arrive. We showed that the 
PL update of a particle approximation is thrifty by contrast, and that adding rejuvenation 
to the propagate steps mimicks the behavior of an ensemble without explicitly maintaining 
one. 

Another advantage of SMC methods is that they are "embarrassingly parallelizable" , 
since many of the relevant calculations on the particles may proceed independently of one 
another, up to having a unique computing node for each particle. In contrast, the Markov 
property of MCMC requires that the inferential steps, to a large extent, proceed in serial. 
Getting the most mileage out of our SMC/PL approach will require a careful asynchronous 
implementation. Observe that the posterior predictive distribution, and the propagate step, 
may be calculated for each particle in parallel. Resampling requires that the particles be 
synchronized, but this is fast once the particle predictive densities have been evaluated. Our 
implementation in the plgp package does not exploit this parallelism. However, it does make 
heavy use of R's lapply method, which automatically loops over the particles to calculate the 
predictive, and to propagate. A parallelized lapply, e.g., using snowfall and sf Cluster, 



as described by Knaus et al. (2009), may be a promising way forward. 
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